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Abstract 



Majority vote plays a fundamental role in many applications of statistics, 
such as ensemble classifiers, crowdsourcing, and elections. When using ma- 
jority vote as a prediction rule, it is of basic interest to ask "How many 
votes are needed to obtain a reliable prediction?" In the context of binary 
classification with Random Forests or Bagging, we give a precise answer: If 
err t denotes the test error achieved by the majority vote of t > 1 classi- 
fiers, and err* denotes its nominal limiting value, then under basic regularity 
| conditions, err t = err* + | + o(j), where c is a constant given by a simple for- 

■ mula. More generally, we show that if V\, V2, . . . is an exchangeable Bernoulli 

sequence with mixture distribution F, and the majority vote is written as 
M t = median(Vi, . . . , V t ), then 1 - E[M t ] = + \F"{\)\ + o{\) when 
F is sufficiently smooth. 

CN ■ 1 Introduction 

l> 
O 



Majority vote is a core principle for aggregating decisions. At an abstract level, 
votes are a statistical resource, which may be obtained for a cost, such as compu- 
CO ■ tation, communication, or time. As more votes are collected, the majority vote is 

typically more likely to select the "correct" candidate, but at a higher cost. This 
trade-off leads to the basic statistical problem of determining the smallest number 
^ . votes needed to make a reliable decision. 

An important instance of this general problem arises in the context of ensemble 
methods for binary classification. Well-known examples of ensemble methods in- 



clude Bagging, Boosting, and Random Forests ( Breiman . 19961 . 2001 ; Freund and Schapire . 



1995). The connection between voting and ensemble methods arises in the follow 



ing way. Given a fixed set of Nq labeled training examples T> = {(-Xj">ijf)}j=i 
in a sample space X x {0, 1}, an algorithm is used to train an ensemble of 
t > 1 base classifiers Qi : X — > {0, 1}, i = l,...,t, which are often random- 
ized. The predictions of the base classifiers are then aggregated by a particu- 
lar rule, with majority vote being the standard choice for Bagging and Random 
Forests. More concretely, if a test point (X, Y) is sampled from X x {0, 1} with 
Y being unknown, then the prediction of the whole ensemble is given by the 
median of the predicted labels Q\(X), . . . ,Q t (X). We denote the test error by 
erv t = P( median(Qi(A), . . . , Q t (X)) 7^ Y | V), and always assume that t is odd 
to eliminate the issue of ties. 
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For many ensemble methods, the test error err f typically decreases and then 
stabilizes as the number of base classifiers becomes large (t — > oo). Likewise, the 
nominal limiting value, say err* , is viewed as a target level of performance. As err t 
approaches err*, we also pay an increasing computational cost — since each base 
classifier must be separately trained, stored in memory, and evaluated for new 
predictions. Furthermore, the cost is often compounded by the need to carry out 
the entire procedure for a variety of different tuning parameters. Consequently, it 
is natural to select the smallest number t* such that (err^* — err*| is less than a 
given tolerance, which amounts to determining the convergence rate of err^. This 
is the problem we aim to solve in the present paper, with particular emphasis on 
the methods of Bagging and Random Forests. 

Despite the close connection between the convergence rate of err^ and the 
computational cost of an ensemble, this issue has received little attention in the 
literature for Bagging and Random Forests. Regarding the more distinct method of 
Boost i ng, a substan t ial amount of work has been done to analyze its rate of conver- 
gence (jBickel et all l200d iMukheriee et all l201ll : ISchapird . l20ld : IZhang and Yd . 

However, the aggregation rule used in Boosting is very different from ordi- 
nary majority vote, since the Boosting algorithm iteratively reweights the votes 
as the ensemble is grown. Our work here focuses on the majority voting rule when 
reweighting does not occur, and our results are not comparable to convergence 
rates for Boosting. 



The most closely related work that we are aware of is a paper bv lNg and Jordan 



(2001), which analyzes the convergence rate of the so-called "voting Gibbs classi- 
fier" — a Bayesian ensemble method that generates labels Q\(X), . . . Qt(X) from 
a posterior distribution, and then aggregates them via majority vote. The error 
measure studied by Ng and Jordan is essentially a Bayesian version of err^, and 
they prove that its convergence rate is at most 0{ 4) under basic regularity condi- 
tions. With regard to the problem of choosing t, their analysis techniques do not 
seem to be generally applicable, since they do not specify a constant or the exact 
rate of convergence. Apart from that paper, we are not aware of similar results 
for other ensemble methods involving majority vote. 

In this paper, our primary contribution is a formula for err^ that is exact to 
order ~. The formula is applicable to any ensemble method based on the majority 
vote of an exchangeable sequence of labels (cf. Assumption [1] below). In fact, the 
statement of our result in Theorem [1] extends beyond the context of classifica- 
tion, and may be relevant to aggregation problems in othe r areas, such as recom- 
mend er systems, online markets, or social choice theory (lEasley and Kleinbergi . 
20101 ). Given that many voting models are analyzed under the restrictive assump- 
tion of i.i.d. votes, our much weaker assumption of exchangeability also lends 
itself to applications. Further discussion o f exchangeab le voting models in social 
choice theory may be found in lBerel(|l993h : Ladha (|l993h :l IZaigraev and Kaniovskil 
(120121 ). 



The statement of our main result is given with proof in the following section. 
Technical lemmas are proved in Section [3j 



2 



2 Main results 



To define the test error in precise terms, there are three sources of randomness 
to consider: the training set V, the randomized base classifiers Qi(-), and the test 
point (X, Y). For the purposes of our analysis, the randomness in T> will play no 
role, and all of our probability statements will be conditional on T>. Even though T> 
is viewed as fixed, the functions Qi(-) may depend on additional randomization in 
the training algorithm. For instance, the Bagging and Random Forests algorithms 
draw a random subset of T> to train each Qi(-). Lastly, the test point (X, Y) is 
sampled as a pair from X x {0, 1}, independently of V and the functions Qi{-). 
Altogether, if we let Q = (Qi, . . . , Qt) and write the joint distribution of (X, Y, Q) 
as ftx,Y,Q)> then we define 

err 4 := P (X)y)Q) (median(Q 1 (X), . . .,Q t (X)) ^Y\V). (1) 

The subscript on P(x,Y,Q) w iU be omitted from now on. 

The main technical challenge of studying errt arises from the correlation struc- 
ture of the labels Q\(X), Q2(X), . . . , which may be very complex in general. Nev- 
ertheless, in the cases of Random Forests and Bagging, the correlation structure 
is constrained by the fact that the labels for m an exchangeable Bernoulli sequence 
(cf. de Finetti's Theorem ( Billingsley . 20121 . Theorem 35.10)). In particular, Ran- 



dom Forests and Bagging obey the following assumption. 

Assumption 1. The sequence of binary labels Qi(X),Q2(X), . . . is conditionally 
i.i.d., given the test point (X,Y) and the training data T>. 

In fact, th is condition can be found in implicit form in the seminal Random Forests 



paper of (jBreimanl . l200ll . Definition 1. 1), and it has be en used elsewhere as an 



abstract definition of Random Forests ( Biau et al. . 20081 ) . 



By restricting our attention to the class of ensemble methods that satisfy 
Assumption [TJ our analysis of err^ can be reduced to the study of exchangeable 
Bernoulli sequences in the following way. First, if we define ttq := ¥(Y = 0) and 
7Ti := P(y = 1) as the class proportions, then err^ may decomposed as a weighted 
sum of class-wise error rates 

evv t = 7T P(median(Q 1 (A), . . . , Q t (X)) = l\V,Y = 0) 

(2) 

+ 7Ti P(median(Q 1 (X), . . . , Q t (X)) = | V, Y = l) . 

Next, we define {Ui}, i = 1,2,... to be the sequence of predicted labels for a 
random test point drawn from the negative class, 

Wi} = {Qi(X)} | {V, Y = 0}. (3) 

Similarly, we define {Ui} to be the sequence of predicted labels for a random test 
point drawn from the positive class, 

{Ui} £ {Qi{X)} \{V,Y = 1}. (4) 
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It is clear from Assumption 1 that the sequences {U{\ and {Ui} are exchangeable. 
When t is odd, we may also write 

exit = tto E [median (U\, . . . , U n )] + tti(1 — E[median(J7i, . . . , t/n)])- (5) 

Of course, there is no formal distinction between the two terms on the right hand 
side (apart from labeling). Therefore, it is natural to state our main result in terms 
of the more basic object at hand: the running median of a generic exchangeable 
Bernoulli sequence. This also serves to emphasize that our result is broadly ap- 
plicable to other voting scenarios, and is not limited to ensemble methods. 

We now fix some remaining notation for the statement of Theorem [TJ Recall 
from de Finetti's theorem that if Vi, V2, ■ ■ ■ is an exchangeable Bernoulli sequence, 
then there exists a random variable O taking values in the interval [0, 1], such that 
the variables Vi, V2, ■ ■ ■ are conditionally i.i.d. Bernoulli(#), given = 9. If we let 
F : [0, 1] — > [0, 1] denote the distribution function of the variable 0, then we refer 
to F as the mixture distribution for the sequence {Vi}. 

Theorem 1. Let Vi,T^,... be an exchangeable Bernoulli sequence with mixture 
distribution F , and let Mt = median (Vi, . . . , Vi). Suppose F is twice continuously 
differentiable on [0, 1]. Then as t — )• 00, 

l-E[Af t ] = F(I) + !F"(I)i + (i). (6) 

In order to extract the convergence rate of err^ from the theorem, a few more 
pieces of notation are needed. We denote the mixture distributions of the predicted 
label sequences {Ui} and {Ui} by G and G (respectively), and we identify them 
with their distribution functions from [0, 1] to [0, 1]. We also define the constants 

c:=fG"(±)-fG"(±), (7) 
err*:=vr (l-G(i))+vr 1 G(I). (8) 

Due to the relation ((H), our formula for errj is obtained directly from Theorem [H 
Note also that the convergence err^ — > err* follows from our result, and is not an 
assumption. 

Corollary 1. Suppose G and G are twice continuously differentiable on [0,1]. 
Then as t — > 00, 

err n = err* + f +oQ). (9) 

In light of the corollary, it is natural to wonder what meaning the functions G 
and G have in terms of the classification problem. The idea may be explained in 
the following way. If x £ X is any fixed point in the sample space, then we define 
the function 7? : X -)■ [0, 1] by 

•&{x) := E[Qi(x) I V\. (10) 

Hence, if x is a point that should be labeled with "1", then $(x) represents 
the average accuracy of the ensemble at that point. (Note that the variables 
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Qi(x), Q2(x), ■ ■ ■ are i.i.d.) Having defined it is easy to see that G and G are 
the distribution functions of the following random variables, 



G = #(X) \{V,Y = 0}, (11) 
G = Q{X) | {V,Y = 1}. (12) 

Viewing i? as an "accuracy function" on X, our differentiability assumptions on 
G and G express the idea that there is a smooth transition between regions of X 
that are easy to classify and regions that are difficult to classify. Specifically, the 
existence of G"(l/2) and G"(l/2) implies that the test points (X, Y) assign their 
mass smoothly over the boundary of "ambiguous points" where $(x) = 1/2. Note 
too that this smoothness condition depends on both the ensemble algorithm, and 
on the way the test points (X,Y) are distributed over X x {0, 1}. In this sense, 
the functions G and G offer a very compact way of encoding all in the information 
in the problem that is relevant to err^. 

We conclude this section by turning to the proof of Theorem [TJ The main 
technique involved is to represent EfiVfJ in terms of a second order Edgeworth 
expansion for the binomial distribution function. Although it is also possible to 
study E[Mf] using a simple Hoeffding bound, that approach seems to lead to an 
inferior rate of 0{-^=). 
Proof of Theorem 1. We begin by writing 

l-E[M t ]=P(i£*=iK<l/2) 

r 1 + (13) 

= / P(fE;=i^<V2|e = 0)dF(0), 
J o 

where we note that the integrand is a binomial distribution function. Our aim is 
to evaluate the limit of the scaled difference 

t(l - E[M t ] - f (1/2)) = f t(F(\ YLi Vi<l/2\@ = e)- F(l/2))dF(9), (14) 

Jo 

and show that the limit is equal to 

The first main portion of the proof involves reducing the last integral to a 
more concrete form. If we let $% denote the second order Edgeworth expansion for 
the distribution function of Binomial(t, 6), then Lemma[T]in Section [3] guarantees 
the following uniform approximation for any 9 £ (0, 1), 



sup 



Vi 



\ ELi Vi-e)<z\@ = 9)- £ t {z)\ = o{t- 1 ). (15) 



The boundary cases of 9 = 0, 1 will play no role, since the continuity of F implies 
that hits the endpoints of [0, 1] with zero probability. Using the uniformity of 
the condition (fl~5"j) . and letting 
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we may replace the probability P(± YlUi ^ < 1/2 | 6 1 ) in line (Q3J) with S t {z(9; t)). 
Hence, 



*(l-E[M t ]-F(l/2)) = f 1 t[g t {z{e-t))-F{l/2))dF(9)+o(l). (17) 

J o 

We also let ^t(^) denote the first and second order terms of the expansion, 

S t {z) = + ^(z), (18) 
which allows us to express the scaled difference t(l — E[M t ] — F(l/2)) as 



f 1 t($(z(9; t)) - F{l/2))dF{9) + f t <p t ( 
o J o 



^;t))dF(0) + o(l). (19) 

Surprisingly, the higher order terms <pt{z) give no contribution to the term ^F"(^)j 
of the main formula ©. In particular, Lemma [3] in Section [3] shows that the sec- 
ond integral Jq t ip t (z(9;t))dF(9) tends to as t — > oo. Consequently, it remains 
to consider the limit of t{${z{9; t)) - F(l/2))dF(9). 

In order to simplify the first integral in line (|19p . it is an essential element of 
the proof to notice that z(9;t) is a smooth and monotone function of 9 G (0, 1), 
which may be inverted for any fixed t, giving 

e = 9(z;t) = l-^L. 

Changing variables from 9 to z, and then integrating by parts, it follows that 

r-l r-oo 

/ t(<S>(z(9;t))-F(l/2))dF(9)= t(F(9(z; t)) - F(l/2)) <f>(z) dz, (20) 

JO J-oo 

where 4> denotes the standard normal density. We note that an extra minus sign 
has been introduced because z is a decreasing function of 9. Since 9(0; t) is equal 
to 1/2 for all t, a first order Taylor expansion at z = gives 

F(9(z; t)) - F(l/2) = F'(1/2)0'(O; t) z + R(z; t), (21) 

with R(z; t) denoting the remainder. Using J zcp(z)dz = 0, the first order term in 
line (|2ip vanishes upon integration for every t, giving 

pi poo 

/ t(<S>(z(9;t)-F(l/2))dF(9)= tR(z;t) <f>(z) dz. (22) 

JO J-oo 

We conclude the proof by determining the pointwise limit of R(z; t) and ap- 
plying the dominated convergence theorem. Writing the remainder R(z; t) in La- 
grange form, we have 

t R(z; t) = \t- (F"(9(H- 1) t)] 2 + F'(0(£; t)) t)) ■ z\ (23) 



for some ^ between and z. Using the continuity of F" at 1/2, as well as the 
formulas 

O'(z;t) = - T im (24) 
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and 

e"{z;t) = w ^ w , (25) 

we obtain the following pointwise limit for each fixed z as t — > oo, 

tR{z;t)^\F"{\)z\ (26) 

It is straightforward to check that tR(z;t) is dominated by a fixed polynomial in 
z, which is clearly integrable with respect to <f>. The details are given in Lemma H] 
of Section [3j Consequently, the desired limit 

/oo 
tR{z;t)4>{z)dz^\F"{\). 
-oo 

follows from the formula J z <fi(z)dz = 1. □ 



3 Technical lemmas 

The technical lemmas supporting Theorem[T]are based largely on the second order 
Edgeworth expansion of the binomial distribution function. Since the binomial dis- 
tribution arises from a sum of lattice variables, the expansion includes a number of 
terms that are absent from the usual expansion for conti nuous variables. The fol- 
lowin g; result has been adapted from a general theorem in ( Bhattacharya and Raol . 



IS (Theorem 23.1), winch handles multivariate lattice distributions. Additional 



work is needed to obtain explicit formulas for the second order terms in their ex- 
pansion, but we omit these t edious calculations . Formulas for the second order 
terms have also been given in ( Brown et al. . 20021 ) (Lemma 1), with the only differ- 



ence being that we have written the expansion in terms of Hermite and Bernoulli 
polynomials to simplify the proof of Lemma [2] below. 

To state the first lemma, several pieces of notation are needed. We define the 
parameter 

Pt = Pt (e,z):=l±-te + Viaz], (27) 

where a 2 := 9(1 — 0) and the symbol [a] denotes the fractional part of a real 
number a. The first two Bernoulli polynomials are 

B ^\=P\-l 1 ( 28) 
B 2 {fH) = Pt ~ Pt + \, V ^ 

and the first several Hermite polynomials are 

H l (z) = z 

H 2 (z) = z 2 -l 

H 3 (z) = z 3 - 3z (29) 
H 4 (z) = z 4 -6z 2 + 3 
H 5 (z) = z 5 - Wz 3 + 15z. 

Lastly, if m denotes the ith cumulant of the Bernoulli(#) distribution, then the 
"standardized cumulants" are given by 

^ = -^=22= and ^ = ^. (30) 
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Lemma 1 (Bhattacharya and Rao, 1976). Let W±, W2, ■ ■ ■ , be i.i.d. Bernoulli(0) 
variables with 9 6 (0, 1). For any z£i and t > 1, define 



(31) 



Then, 

sup \G t (z)- S t {z)\ =o(r l ), (32) 

zeM 

where S't is the second order Edgeworth expansion given by 

S t {z) = $(z) - 0(z)(^tf 2 (z) + ^H s (z) + ±(%)*H 6 (zj) 

- ^z)[^B x {p t ) + ^H 3 (z)B 1 (p t ) + ^l Hl (z)B 2 ( Pt ) 



(33) 



Remarks. We note that if the terms involving B\ and B2 were omitted, then 
the expansion ()33f) would exactly match the w ell-known formula for the case of 



continuous variables, which may be found in (jMcCullagh and Nelderl . Il989l . p. 
474). 



The next lemma gives formulas for the higher order terms of the expansion (|33D 
under the change of variable z = v/ *j 1 / 2 ~ e I , i n particular, it is possible to write 

the cumulants of Bernoulli(#) as functions of z when this relationship is inverted. 
For simplicity, we have presented the formula only for odd values of t. Apart 
from some added technical detail, the arguments that rely on this lemma are no 
different in the case of even t. 

Lemma 2. When z = ^=^=2 and 9 € (0,1), the cumulants of Bernoulli(9) 
satisfy 

% = ft and £ = ^ " 2 ' (34) 
and the parameter pt defined in line f|27|) satisfies pt = l_t/2j. Also, for this choice 
of z and odd values of t, the function <ft(z) = ^t( z ) — ®(z) is given by 

y t {z) = <Kz)(^H 2 (z) + U'-f- ~ + JwMz) ~ (4r)^))- (35) 

Remarks. The proof of this lemma only involves algebraic manipulations and 
is hence omitted. The main simplification that occurs for the case of odd t is that 
Bi(pt) = Pt ~ 2 van ishes identically (hence removing two terms from line (|33p ), 

3.1 The higher order terms vanish as t — > 00. 

Lemma 3. Assume the conditions of TheoremU] hold. If z(9;t) = i an d 

J L-l 0(1-6) 

(ft(z) = $t{z) — ${z), then as t — > oo, 

f t(pt(z(0;t))dF(6)->O. (36) 
J 
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Proof. Changing variables from 9 to z, and integrating by parts gives 

t ipt(z(0; t))dF{9) = - t <p' t (z)F(9(z; t)) dz, (37) 

J — oo 

where it is simple to check that the boundary term vanishes for any fixed t us- 
ing the formula (|35|) . To consider the right side of line (|37p . it follows from the 
formula (13511 and the relation 



&[<j>(z)H j (z)] = -<l>(z)H j+1 (z), 

that the 0{\) terms of ip'ti z ) can be written in the form j(f)(z)Hi(z)Hj(z), i ^ j, 
(up to constants) where we note that Hq(z) = 1 and H\(z) = z. Since the functions 
F(9(z;t)) are uniformly bounded by 1 and converge pointwise to the constant 
F(l/2) as t — > oo, the dominated convergence theorem and the orthogonality 
of Hermite polynomials imply that the integral J t ip' t (z)F(0(z;t))dz converges to 
0. □ 



3.2 The remainder is dominated. 

To simplify the statement and proof of the following lemma, we write at(z) < bt(z) 
for two sequences of non- negative functions at(x) and bt(x) if there is an absolute 
constant c > such that at(x) < cbt(x) for all t > 1, and all z € E. 

Lemma 4. Assume the conditions of Theorem\^hold. Then, the remainder R(z;t) 
in line ([HI]) satisfies the bound 

t\R(z;t)\ <1 + M 3 . (38) 

Proof From line (j23|) we have 

t R(z; t) = \t- t) (e'(Z; t)f + F'(6(Z; t)) 9"& t)) ■ z\ (39) 

where £ is a number between and z. By assumption, F" and F' are continuous 
functions on [0,1]. Consequently, the quantities \F"(G(£;t))\ and F'(8(£;t)) are 
uniformly bounded by a constant. It suffices to show that t(9' '(£; t)) 2 and t\9"(£; t)\ 
are dominated by fixed polynomials in z for all £ E [— \z\, \ z\] and all t > 1. Due 
to the formulas (j24"l) and (|25]l . we have 

t(o\m 2 = t{-^ m ) 2 < i, (40) 

^"K;*)l = £ W. ( 41 ) 

and the statement of the lemma follows easily. □ 
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